# Image-Text Generation
Qwen Qwen2.5 VL 72B Instruct GGUF
Other
A quantized version of the Qwen2.5-VL-72B-Instruct multimodal large language model, supporting image-text-to-text tasks, suitable for various quantization levels from high precision to low memory requirements.
Text-to-Image English
Q
bartowski
1,336
1
Jedi 7B 1080p
Apache-2.0
Qwen2.5-VL-7B-Instruct is a multimodal model based on the Qwen2.5 architecture, supporting joint processing of images and text, suitable for vision-language tasks.
Image-to-Text
Safetensors English
J
xlangai
239
2
UI TARS 1.5 7B 4bit
Apache-2.0
UI-TARS-1.5-7B-4bit is a multimodal model focused on image-text-to-text conversion tasks, supporting the English language.
Image-to-Text
Transformers Supports Multiple Languages

U
mlx-community
184
1
Internvl3 1B Hf
Other
InternVL3 is an advanced series of multimodal large language models, demonstrating exceptional multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text
Transformers Other

I
OpenGVLab
1,844
2
Barcenas 4b
A multimodal model trained based on google/gemma-3-4b-it, specializing in high-quality data processing for mathematics, programming, science, and puzzle-solving domains.
Image-to-Text
Transformers English

B
Danielbrdz
15
2
Gemma 3 4b It GPTQ 4b 128g
INT4 quantized version based on the gemma-3-4b-it model, significantly reducing storage and computational resource requirements
Image-to-Text
Transformers

G
ISTA-DASLab
502
2
Qwen2.5 VL 7B Instruct Gptqmodel Int8
MIT
A vision-language model based on the Qwen2.5-VL-7B-Instruct model with GPTQ-INT8 quantization
Image-to-Text
Transformers Supports Multiple Languages

Q
wanzhenchn
101
0
Gemma 3 12b It Qat Q4 0 Unquantized
Gemma 3 is Google's lightweight open-source multimodal model series based on Gemini technology, supporting text and image inputs with text outputs. The 12B version undergoes instruction tuning and quantization-aware training (QAT), making it suitable for deployment in resource-limited environments.
Text-to-Image
Transformers

G
google
1,159
10
Vora 7B Instruct
VoRA is a vision-language model based on 7B parameters, focusing on image-text-to-text conversion tasks.
Image-to-Text
Transformers

V
Hon-Wong
154
12
Vora 7B Base
VoRA is a vision-language model based on 7B parameters, capable of processing image and text inputs to generate text outputs.
Image-to-Text
Transformers

V
Hon-Wong
62
4
Qwen2.5 VL 7B Instruct Q4 K M GGUF
Apache-2.0
This is the GGUF quantized version of the Qwen2.5-VL-7B-Instruct model, suitable for multimodal tasks and supports both image and text inputs.
Image-to-Text English
Q
PatataAliena
69
1
Qwen2.5 VL 7B Instruct GGUF
Apache-2.0
Qwen2.5-VL-7B-Instruct is a multimodal vision-language model that supports image understanding and text generation tasks.
Image-to-Text English
Q
Mungert
17.10k
10
Heron NVILA Lite 1B
Apache-2.0
A Japanese visual language model trained based on the NVILA-Lite architecture, supporting image-text interaction in both Japanese and English
Image-to-Text
Safetensors Supports Multiple Languages
H
turing-motors
460
2
Qwen.qwen2.5 VL 72B Instruct GGUF
Qwen2.5-VL-72B-Instruct is a large-scale vision-language model developed by the Tongyi Qianwen team, supporting multimodal understanding and generation of images and text.
Image-to-Text
Q
DevQuasar
281
0
Gemma 3 4b Pt Qat Q4 0 Gguf
Gemma 3 is a lightweight open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.
Image-to-Text
G
google
912
16
Chameleon 7b
Other
Meta's Chameleon series 7B-parameter multimodal model supporting image-text-to-text tasks
Large Language Model
C
FriendliAI
24
1
Toriigate V0.4 7B GGUF
Apache-2.0
The static quantized version of ToriiGate-v0.4-7B, suitable for multimodal, vision, and image-text-to-text tasks
Image-to-Text
Transformers English

T
mradermacher
668
0
Internvl2 5 4B AWQ
MIT
InternVL2_5-4B-AWQ is the AWQ quantized version of InternVL2_5-4B using autoawq, supporting multilingual and multimodal tasks.
Image-to-Text
Transformers Other

I
rootonchair
29
2
Gemma 3 4b It
Gemma is a lightweight, advanced open model series launched by Google, built on the same research and technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.
Image-to-Text
Transformers

G
google
608.22k
477
Qwen2 VL 2B Instruct GGUF
Apache-2.0
Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports interaction between images and text, suitable for image understanding and generation tasks.
Image-to-Text English
Q
gaianet
95
1
Qwen2 VL 7B Instruct GGUF
Apache-2.0
Qwen2-VL-7B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for images and text.
Image-to-Text English
Q
second-state
195
4
Minivla Libero90 Prismatic
MIT
MiniVLA is a 1-billion-parameter vision-language model compatible with the Prismatic Vision-Language Model codebase, suitable for robotics and multimodal tasks.
Image-to-Text
Transformers English

M
Stanford-ILIAD
127
0
P MoD LLaVA NeXT 7B
Apache-2.0
p-MoD is a hybrid-depth multimodal large language model built using the progressive ratio decay method, supporting image-to-text generation tasks.
Image-to-Text
P
MCG-NJU
74
4
Paligemma2 10b Pt 224
PaliGemma 2 is a vision-language model (VLM) that combines the capabilities of the Gemma 2 model. It can process both image and text inputs simultaneously and generate text outputs, supporting multiple languages. It is suitable for various vision-language tasks such as image and short video captioning, visual question answering, text reading, object detection, and object segmentation.
Image-to-Text
Transformers

P
google
3,362
8
Paligemma2 10b Mix 224
PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.
Image-to-Text
Transformers

P
google
701
7
Xgen Mm Phi3 Mini Instruct Interleave R V1.5
Apache-2.0
xGen-MM is a series of the latest foundational large multimodal models (LMMs) developed by Salesforce AI Research, building upon the successful design of the BLIP series with foundational enhancements to ensure a more robust and superior model foundation.
Image-to-Text
Safetensors English
X
Salesforce
7,373
51
Llava MORE Llama 3 1 8B Finetuning
Apache-2.0
LLaVA-MORE is an enhanced version based on the LLaVA architecture, integrating LLaMA 3.1 as the language model, focusing on image-to-text tasks.
Image-to-Text
Transformers

L
aimagelab
215
9
Llama 3.1 8B Vision 378
This project trained a projection module to add visual capabilities to Llama 3 using SigLIP technology, applied to the Llama-3.1-8B-Instruct model.
Image-to-Text
Transformers

L
qresearch
203
35
Florence 2 Large Ft
MIT
Florence-2 is an advanced visual foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.
Text-to-Image
Transformers

F
microsoft
269.44k
349
Florence 2 Large
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.
Text-to-Image
Transformers

F
microsoft
579.23k
1,530
Paligemma 3b Mix 224
PaliGemma is a versatile, lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs with text outputs.
Text-to-Image
Transformers

P
google
143.03k
75
Llava Llama 3 8b V1 1 Q3 K S GGUF
This model is a GGUF format conversion based on xtuner/llava-llama-3-8b-v1_1, supporting multimodal processing of images and text.
Image-to-Text
L
djward888
17
1
Llava Llama 3 8b V1 1 Q5 K M GGUF
This model is a GGUF format version converted from xtuner/llava-llama-3-8b-v1_1, suitable for the llama.cpp framework, supporting image-text-to-text conversion tasks.
Image-to-Text
L
djward888
20
2
Llava Llama 3 8b V1 1 Q4 K M GGUF
This model is a GGUF format conversion based on xtuner/llava-llama-3-8b-v1_1, supporting multimodal interaction between images and text.
Image-to-Text
L
RaincloudAi
51
1
Moai 7B
MIT
MoAI is a large-scale language and vision hybrid model capable of processing both image and text inputs to generate text outputs.
Image-to-Text
Transformers

M
BK-Lee
183
45
Llava V1.6 Vicuna 7b Gguf
Apache-2.0
LLaVA is an open-source multimodal chatbot trained by fine-tuning LLM on multimodal instruction-following data. This version is the GGUF quantized version, offering multiple quantization options.
Text-to-Image
L
cjpais
493
5
Llava V1.5 7b Gguf
LLaVA is an open-source multimodal chatbot, fine-tuned on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.
Image-to-Text
L
granddad
13
0
Image Captioning With Blip
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation, supporting both conditional and unconditional text generation
Image-to-Text
Transformers

I
Vidensogende
16
0
Idefics 80b
Other
IDEFICS-9B is a 9-billion-parameter multimodal model capable of processing both image and text inputs to generate text outputs. It is an open-source replication of Deepmind's Flamingo model.
Image-to-Text
Transformers English

I
HuggingFaceM4
70
70
Featured Recommended AI Models